18 research outputs found

    Characterization of Audiovisual Dramatic Attitudes

    Get PDF
    International audienceIn this work we explore the capability of audiovisual parameters (such as voice frequency, rhythm, head motion or facial expressions) to discriminate among different dramatic attitudes. We extract the audiovisual parameters from an acted corpus of attitudes and structure them as frame, syllable, and sentence-level features. Using Linear Discriminant Analysis classifiers, we show that sentence-level features present a higher discriminating rate among the attitudes and are less dependent on the speaker than frame and sylable features. We also compare the classification results with the perceptual evaluation tests, showing that voice frequency is correlated to the perceptual results for all attitudes, while other features, such as head motion, contribute differently, depending both on the attitude and the speaker

    Audiovisual Generation of Social Attitudes from Neutral Stimuli

    Get PDF
    International audienceThe focus of this study is the generation of expressive audiovisual speech from neutral utterances for 3D virtual actors. Taking into account the segmental and suprasegmental aspects of audiovisual speech, we propose and compare several computational frameworks for the generation of expressive speech and face animation. We notably evaluate a standard frame-based conversion approach with two other methods that postulate the existence of global prosodic audiovisual patterns that are characteristic of social attitudes. The proposed approaches are tested on a database of " Exercises in Style " [1] performed by two semi-professional actors and results are evaluated using crowdsourced perceptual tests. The first test performs a qualitative validation of the animation platform while the second is a comparative study between several expressive speech generation methods. We evaluate how the expressiveness of our audiovisual performances is perceived in comparison to resynthesized original utterances and the outputs of a purely frame-based conversion system

    Audio-Visual Speaker Conversion using Prosody Features

    Get PDF
    International audienceThe article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.L'article présente une approche audio-visuelle pour la conversion de locuteur, basée sur des méthodes statistiques initialement proposées pour la conversion de voix. En utilisant le corpus audiovisuel BIWI 3D, des modèles de conversion entre locuteurs sont calculés séparément pour la voix et les expressions faciales. Les résultats obtenus en combinant les deux modalités sont comparés subjectivement avec d'autres méthodes et démontrent l'importance de la dynamique et de la prosodie

    Audio-Visual Speaker Conversion using Prosody Features

    No full text
    International audienceThe article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.L'article présente une approche audio-visuelle pour la conversion de locuteur, basée sur des méthodes statistiques initialement proposées pour la conversion de voix. En utilisant le corpus audiovisuel BIWI 3D, des modèles de conversion entre locuteurs sont calculés séparément pour la voix et les expressions faciales. Les résultats obtenus en combinant les deux modalités sont comparés subjectivement avec d'autres méthodes et démontrent l'importance de la dynamique et de la prosodie

    Figurines, a multimodal framework for tangible storytelling

    Get PDF
    Author versionInternational audienceThis paper presents Figurines, an offline framework for narrative creation with tangible objects, designed to record storytelling sessions with children, teenagers or adults. This framework uses tangible diegetic objects to record a free narrative from up to two storytellers and construct a fully annotated representation of the story. This representation is composed of the 3D position and orientation of the figurines, the position of decor elements and interpretation of the storytellers' actions (facial expression, gestures and voice). While maintaining the playful dimension of the storytelling session, the system must tackle the challenge of recovering the free-form motion of the figurines and the storytellers in uncontrolled environments. To do so, we record the storytelling session using a hybrid setup with two RGB-D sensors and figurines augmented with IMU sensors. The first RGB-D sensor completes IMU information in order to identify figurines and tracks them as well as decor elements. It also tracks the storytellers jointly with the second RGB-D sensor. The framework has been used to record preliminary experiments to validate interest of our approach. These experiments evaluate figurine following and combination of motion and storyteller's voice, gesture and facial expressions. In a make-believe game, this story representation was re-targeted on virtual characters to produce an animated version of the story. The final goal of the Figurines framework is to enhance our understanding of the creative processes at work during immersive storytelling

    Reactive Statistical Mapping: Towards the Sketching of Performative Control with Data

    Get PDF
    Part 1: Fundamental IssuesInternational audienceThis paper presents the results of our participation to the ninth eNTERFACE workshop on multimodal user interfaces. Our target for this workshop was to bring some technologies currently used in speech recognition and synthesis to a new level, i.e. being the core of a new HMM-based mapping system. The idea of statistical mapping has been investigated, more precisely how to use Gaussian Mixture Models and Hidden Markov Models for realtime and reactive generation of new trajectories from inputted labels and for realtime regression in a continuous-to-continuous use case. As a result, we have developed several proofs of concept, including an incremental speech synthesiser, a software for exploring stylistic spaces for gait and facial motion in realtime, a reactive audiovisual laughter and a prototype demonstrating the realtime reconstruction of lower body gait motion strictly from upper body motion, with conservation of the stylistic properties. This project has been the opportunity to formalise HMM-based mapping, integrate various of these innovations into the Mage library and explore the development of a realtime gesture recognition tool

    A system for creating virtual reality content from make-believe games

    Get PDF
    International audiencePretend play is a storytelling technique, naturally used from very young ages, which relies on object substitution to represent the characters of the imagined story. We propose a system which assists the storyteller by generating a virtualized story from a recorded dialogue performed with 3D printed figurines. We capture the gestures and facial expressions of the storyteller using Kinect cameras and IMU sensors and transfer them to their virtual counterparts in the story-world. As a proof-of-concept, we demonstrate our system with an improvised story involving a prince and a witch, which was successfully recorded and transferred into 3D animation

    3D Human Pose Estimation from Monocular Image Sequences

    No full text

    Generation of audio-visual prosody for expressive virtual actors

    No full text
    Le travail presenté dans cette thèse adresse le problème de génération des performances expressives audio-visuelles pour les acteurs virtuels. Un acteur virtuel est répresenté par une tête parlante en 3D et une performance audio-visuelle contient les expressions faciales, les mouvements de la tête, la direction du regard et le signal de parole.Si une importante partie de la littérature a été dediée aux émotions, nous explorons ici les comportements expressifs verbaux qui signalent les états mentaux, i.e. "ce que le locuteur sent par rapport à ce qu'il dit". Nous explorons les caractéristiques de ces attitudes dites dramatiques et la manière dont elles sont encodées par des signatures prosodiques spécifiques pour une personne i.e. des motifs spécifiques à l'état mental de trajectoires de paramètres prosodiques audio-visuels.The work presented in this thesis addresses the problem of generating audio-visual expressive performances for virtual actors. A virtual actor is represented by a 3D talking head and an audio-visual performance refers to facial expressions, head movements, gaze direction and the speech signal.While an important amount of work has been dedicated to emotions, we explore here expressive verbal behaviors that signal mental states, i.e "how speakers feel about what they say". We explore the characteristics of these so-called dramatic attitudes and the way they are encoded with speaker-specific prosodic signatures i.e. mental state-specific patterns of trajectories of audio-visual prosodic parameters
    corecore